Me: Chris Hausler, Snr Data Engineer @ Zendesk

Now: Pandas

Python Data Analysis Library (pandas)

open source
BSD-licensed library
high-performance
easy-to-use ... sort of

Creates somthing similar to R DataFrames.. but better

I think it's great, but I'm still a bit clumsy with it .. also the doco is still a little hit and miss

Some imports



In [8]:

    
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings("ignore", category=UserWarning)



In [17]:

    
import numpy as np
import pandas as pd
import pylab as plt
import matplotlib 
import seaborn as sns
%matplotlib inline
pd.__version__









    Out[17]:





'0.14.1'

`pandas` has two main data structures: `Series` and `DataFrame`

Series - Like a one dimensional array but better



In [10]:

    
values = [5,3,4,8,2,9]
vals = pd.Series(values)
vals









    Out[10]:





0    5
1    3
2    4
3    8
4    2
5    9
dtype: int64

Each value is now associated with an index.

The index itself is an object of class `Index` and can be manipulated directly.



In [11]:

    
vals.index









    Out[11]:





Int64Index([0, 1, 2, 3, 4, 5], dtype='int64')



In [12]:

    
vals.values









    Out[12]:





array([5, 3, 4, 8, 2, 9])



In [13]:

    
vals * 2.5









    Out[13]:





0    12.5
1     7.5
2    10.0
3    20.0
4     5.0
5    22.5
dtype: float64

We can use text based indexes too



In [14]:

    
vals2 = pd.Series(values, index=['tom','sally','jeff','george','pablo','florence'])
vals2









    Out[14]:





tom         5
sally       3
jeff        4
george      8
pablo       2
florence    9
dtype: int64

And use these to get the data we want



In [18]:

    
vals2[['florence','tom']]









    Out[18]:





florence    9
tom         5
dtype: int64



In [19]:

    
vals2[['florence','tom','kate']]









    Out[19]:





florence     9
tom          5
kate       NaN
dtype: float64

Dealing with missing values



In [20]:

    
vals3 = vals2[['tom','sally','pablo','florence','ricky','katrin']]
vals3









    Out[20]:





tom          5
sally        3
pablo        2
florence     9
ricky      NaN
katrin     NaN
dtype: float64

1. Get rid of them



In [22]:

    
vals3.dropna()









    Out[22]:





tom         5
sally       3
pablo       2
florence    9
dtype: float64

2. Fill them with something



In [23]:

    
vals3.fillna(0)









    Out[23]:





tom         5
sally       3
pablo       2
florence    9
ricky       0
katrin      0
dtype: float64

3. Fill them with a calculated value



In [24]:

    
vals3.fillna(vals3.mean())









    Out[24]:





tom         5.00
sally       3.00
pablo       2.00
florence    9.00
ricky       4.75
katrin      4.75
dtype: float64

4. Use a function like forward fill



In [25]:

    
vals3.fillna(method='ffill')









    Out[25]:





tom         5
sally       3
pablo       2
florence    9
ricky       9
katrin      9
dtype: float64

We also have a handy way to get a picture of our data



In [26]:

    
vals3.describe()









    Out[26]:





count    4.000000
mean     4.750000
std      3.095696
min      2.000000
25%      2.750000
50%      4.000000
75%      6.000000
max      9.000000
dtype: float64

DataFrame

Like a 2D array... with bells and whistles



In [28]:

    
vals.index=pd.Index(['tom','sally','pablo','florence','ricky','katrin'])
vals3=vals3[['tom','sally','pablo','florence','billy','katrin']]



In [29]:

    
# create a dataframe
dat = pd.DataFrame({'orig':vals,'new':vals3})
dat

Check for nulls



In [30]:

    
dat.isnull()









    Out[30]:






  
    
      
      new
      orig
    
  
  
    
      billy
        True
        True
    
    
      florence
       False
       False
    
    
      katrin
        True
       False
    
    
      pablo
       False
       False
    
    
      ricky
        True
       False
    
    
      sally
       False
       False
    
    
      tom
       False
       False

Drop rows with nulls



In [31]:

    
dat.dropna()

Timeseries with pandas DataFrames

.... a winning combination

Data from google trends.. what correlates (+ve & -ve) with the search term Hipster

Read hipster correlations from a csv file

Pandas supports many file formats for read and write including

csv
json
pickle
the clipboard



In [33]:

    
hipster = pd.read_csv('hipster.csv')
hipster[:10]









    Out[33]:






  
    
      
      Date
      hipster
      modcloth
      gumtree perth
    
  
  
    
      0
        4/01/04
      -0.976
      -0.817
      -0.844
    
    
      1
       11/01/04
      -0.816
      -0.817
      -0.844
    
    
      2
       18/01/04
      -0.837
      -0.817
      -0.844
    
    
      3
       25/01/04
      -0.976
      -0.817
      -0.844
    
    
      4
        1/02/04
      -0.722
      -0.817
      -0.844
    
    
      5
        8/02/04
      -0.795
      -0.817
      -0.844
    
    
      6
       15/02/04
      -0.723
      -0.817
      -0.844
    
    
      7
       22/02/04
      -0.713
      -0.817
      -0.844
    
    
      8
       29/02/04
      -0.786
      -0.817
      -0.844
    
    
      9
        7/03/04
      -0.675
      -0.817
      -0.844

Set the index to a datetime



In [34]:

    
hipster = hipster.set_index(pd.DatetimeIndex(hipster.pop('Date')))
hipster[:10]









    Out[34]:






  
    
      
      hipster
      modcloth
      gumtree perth
    
  
  
    
      2004-04-01
      -0.976
      -0.817
      -0.844
    
    
      2004-11-01
      -0.816
      -0.817
      -0.844
    
    
      2004-01-18
      -0.837
      -0.817
      -0.844
    
    
      2004-01-25
      -0.976
      -0.817
      -0.844
    
    
      2004-01-02
      -0.722
      -0.817
      -0.844
    
    
      2004-08-02
      -0.795
      -0.817
      -0.844
    
    
      2004-02-15
      -0.723
      -0.817
      -0.844
    
    
      2004-02-22
      -0.713
      -0.817
      -0.844
    
    
      2004-02-29
      -0.786
      -0.817
      -0.844
    
    
      2004-07-03
      -0.675
      -0.817
      -0.844

Now load the anti-Hipster data



In [36]:

    
not_hipster = pd.read_csv('negative_hipster.csv')
not_hipster = not_hipster.set_index(pd.DatetimeIndex(not_hipster.pop('Date')))



In [37]:

    
not_hipster[:10]









    Out[37]:






  
    
      
      yellow pages
      babelfish
      sony ericsson
      melway
    
  
  
    
      2004-04-01
       1.341
       0.090
       0.695
       0.983
    
    
      2004-11-01
       1.239
       0.256
       0.846
       0.714
    
    
      2004-01-18
       1.022
      -0.023
       0.558
       0.694
    
    
      2004-01-25
       0.923
       0.188
       0.645
       1.104
    
    
      2004-01-02
       0.904
      -0.009
       0.513
       0.767
    
    
      2004-08-02
       0.786
       0.338
       0.694
       1.131
    
    
      2004-02-15
       0.729
       0.360
       0.695
       0.985
    
    
      2004-02-22
       0.537
      -0.070
       0.613
       0.859
    
    
      2004-02-29
       0.534
       0.273
       0.447
       1.286
    
    
      2004-07-03
       0.229
       0.595
       0.898
       0.810

Check the values of one column



In [38]:

    
hipster.hipster.head()









    Out[38]:





2004-04-01   -0.976
2004-11-01   -0.816
2004-01-18   -0.837
2004-01-25   -0.976
2004-01-02   -0.722
Name: hipster, dtype: float64

Check another, but get them as an numpy.ndarray



In [39]:

    
hipster['gumtree perth'].values[:20]









    Out[39]:





array([-0.844, -0.844, -0.844, -0.844, -0.844, -0.844, -0.844, -0.844,
       -0.844, -0.844, -0.844, -0.844, -0.844, -0.844, -0.844, -0.844,
       -0.844, -0.844, -0.844, -0.844])

View the data types, they don't need to be homogenous



In [40]:

    
hipster.dtypes









    Out[40]:





hipster          float64
modcloth         float64
gumtree perth    float64
dtype: object

Joins on indexes are easy!



In [41]:

    
trend = hipster.join(not_hipster, how='inner')
trend.head()









    Out[41]:






  
    
      
      hipster
      modcloth
      gumtree perth
      yellow pages
      babelfish
      sony ericsson
      melway
    
  
  
    
      2004-04-01
      -0.976
      -0.817
      -0.844
       1.341
       0.090
       0.695
       0.983
    
    
      2004-11-01
      -0.816
      -0.817
      -0.844
       1.239
       0.256
       0.846
       0.714
    
    
      2004-01-18
      -0.837
      -0.817
      -0.844
       1.022
      -0.023
       0.558
       0.694
    
    
      2004-01-25
      -0.976
      -0.817
      -0.844
       0.923
       0.188
       0.645
       1.104
    
    
      2004-01-02
      -0.722
      -0.817
      -0.844
       0.904
      -0.009
       0.513
       0.767

We can check the column names and values



In [42]:

    
trend.columns









    Out[42]:





Index([u'hipster', u'modcloth', u'gumtree perth', u'yellow pages', u'babelfish', u'sony ericsson', u'melway'], dtype='object')



In [43]:

    
trend.values









    Out[43]:





array([[-0.976, -0.817, -0.844, ...,  0.09 ,  0.695,  0.983],
       [-0.816, -0.817, -0.844, ...,  0.256,  0.846,  0.714],
       [-0.837, -0.817, -0.844, ..., -0.023,  0.558,  0.694],
       ..., 
       [ 1.142,  1.175,  1.394, ..., -1.692, -1.707, -1.171],
       [ 1.187,  1.221,  1.403, ..., -1.654, -1.707, -1.16 ],
       [ 1.514,  1.216,  1.365, ..., -1.664, -1.697, -1.171]])

Filtering on date ranges is simple



In [44]:

    
trend['2012-01-01':].head()









    Out[44]:






  
    
      
      hipster
      modcloth
      gumtree perth
      yellow pages
      babelfish
      sony ericsson
      melway
    
  
  
    
      2012-01-01
       1.411
       1.192
       1.774
      -1.077
      -1.132
      -0.819
      -1.058
    
    
      2012-08-01
       1.513
       1.111
       1.579
      -0.995
      -1.128
      -0.889
      -1.069
    
    
      2012-01-15
       1.523
       1.427
       1.613
      -1.027
      -1.182
      -0.991
      -1.074
    
    
      2012-01-22
       1.600
       1.490
       1.514
      -1.140
      -1.197
      -0.975
      -1.106
    
    
      2012-01-29
       1.459
       1.561
       1.511
      -1.046
      -1.189
      -1.027
      -1.079



In [45]:

    
trend['2012-01-01': '2013-01-01'].tail(3)









    Out[45]:






  
    
      
      hipster
      modcloth
      gumtree perth
      yellow pages
      babelfish
      sony ericsson
      melway
    
  
  
    
      2012-12-16
       1.645
       1.175
       1.407
      -1.433
      -1.523
      -1.545
      -1.192
    
    
      2012-12-23
       1.591
       1.695
       1.625
      -1.698
      -1.540
      -1.499
      -1.207
    
    
      2012-12-30
       1.596
       1.515
       1.868
      -1.515
      -1.531
      -1.535
      -1.207

We can also grab a single date, or a subset of columns



In [46]:

    
trend.ix['2012-01-01', ['hipster', 'modcloth']]









    Out[46]:






  
    
      
      hipster
      modcloth
    
  
  
    
      2012-01-01
       1.411
       1.192

Or do some boolean filtering



In [48]:

    
trend[trend.melway < 0].head()









    Out[48]:






  
    
      
      hipster
      modcloth
      gumtree perth
      yellow pages
      babelfish
      sony ericsson
      melway
    
  
  
    
      2008-02-11
      -0.613
      -0.759
      -0.675
       0.342
       0.402
       0.487
      -0.054
    
    
      2008-12-21
      -0.626
      -0.700
      -0.650
       0.103
       0.305
       0.973
      -0.082
    
    
      2008-12-28
      -0.721
      -0.749
      -0.617
       0.371
       0.160
       0.744
      -0.193
    
    
      2009-04-01
      -0.713
      -0.726
      -0.602
       0.815
       0.256
       0.618
      -0.070
    
    
      2009-11-01
      -0.706
      -0.745
      -0.602
       0.771
       0.058
       0.596
      -0.034

Plotting is built in and easier for dates than matplotlib



In [49]:

    
_ = trend.plot(figsize=(10, 6))
_ = plt.legend(loc='best', ncol=2)

We can also do it for a single column



In [50]:

    
_ = trend.hipster.cumsum().plot()

Or split the columns out to subplots



In [60]:

    
axs = trend.plot(subplots=True, figsize=(10, 10))

Resampling data is also straight forward.



In [52]:

    
# resample by month
trend.resample('M', how='mean').head()









    Out[52]:






  
    
      
      hipster
      modcloth
      gumtree perth
      yellow pages
      babelfish
      sony ericsson
      melway
    
  
  
    
      2004-01-31
      -0.84550
      -0.817
      -0.844
       0.7915
       0.16075
       0.60925
       0.8645
    
    
      2004-02-29
      -0.76725
      -0.817
      -0.844
       0.5880
       0.23400
       0.60125
       0.9355
    
    
      2004-03-31
      -0.81675
      -0.817
      -0.844
       0.4460
       0.40725
       0.78550
       0.8470
    
    
      2004-04-30
      -0.80260
      -0.817
      -0.844
       0.6992
       0.44000
       0.62420
       0.9614
    
    
      2004-05-31
      -0.77000
      -0.817
      -0.844
       0.4052
       0.53860
       1.07800
       0.7422

but one can do business day, week, month, quarter, annual and a bunch of others



In [67]:

    
trend.resample('M', how='mean').hipster.dropna().plot()









    Out[67]:





<matplotlib.axes._subplots.AxesSubplot at 0x109bc9950>

Other fancy plots include a scatter matrix including a kernel density estimation (KDE)



In [68]:

    
# look at the relations
_ = pd.scatter_matrix(trend, figsize=(12,8), diagonal='kde')

Titanic: Machine Learning from Disaster (kaggle.com)

Load the data, explore it and learn from it



In [69]:

    
df = pd.read_csv('train.csv', header=0)



In [70]:

    
df.head()









    Out[70]:






  
    
      
      PassengerId
      Survived
      Pclass
      Name
      Sex
      Age
      SibSp
      Parch
      Ticket
      Fare
      Cabin
      Embarked
    
  
  
    
      0
       1
       0
       3
                                 Braund, Mr. Owen Harris
         male
       22
       1
       0
              A/5 21171
        7.2500
        NaN
       S
    
    
      1
       2
       1
       1
       Cumings, Mrs. John Bradley (Florence Briggs Th...
       female
       38
       1
       0
               PC 17599
       71.2833
        C85
       C
    
    
      2
       3
       1
       3
                                  Heikkinen, Miss. Laina
       female
       26
       0
       0
       STON/O2. 3101282
        7.9250
        NaN
       S
    
    
      3
       4
       1
       1
            Futrelle, Mrs. Jacques Heath (Lily May Peel)
       female
       35
       1
       0
                 113803
       53.1000
       C123
       S
    
    
      4
       5
       0
       3
                                Allen, Mr. William Henry
         male
       35
       0
       0
                 373450
        8.0500
        NaN
       S

Lets look at the data types here (this time they're heterogeneous)



In [71]:

    
df.dtypes









    Out[71]:





PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

We can also get a more verbose summary



In [72]:

    
df.info()









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)

DataFrames can be grouped, like in SQL (it sucked to be a young male on the titanic)



In [73]:

    
df_grouped = df.groupby(['Pclass', 'Sex'])



In [74]:

    
df_grouped[['Age', 'Survived']].mean()

Histograms are straightforward



In [75]:

    
ax = df['Age'].dropna().hist(bins=20, range=(0,100), alpha = .5)
ax.set_xlabel('Age')
ax.set_ylabel('Passenger Count')









    Out[75]:





<matplotlib.text.Text at 0x11019a710>

So are boxplots



In [101]:

    
bp = df.boxplot(column='Age', by='Pclass', grid=False)
# for i in set(df.Pclass):
#     y = df.Age[df.Pclass==i].dropna()
#     # Add some random "jitter" to the x-axis
#     x = np.random.normal(i, 0.04, size=len(y))
#     plt.plot(x, y, 'r.', alpha=0.2)

If we want to do some learning on this data.. lets convert gender to a binary numeric



In [77]:

    
df['isFemale'] = df['Sex'].map( {'female': 1, 'male': 0} ).astype(int)
df[['Sex','isFemale']].head()

Find non-numeric columns so we can drop them later



In [78]:

    
drop_cols = df.columns[df.dtypes.map(lambda x: x=='object')]
drop_cols









    Out[78]:





Index([u'Name', u'Sex', u'Ticket', u'Cabin', u'Embarked'], dtype='object')



In [79]:

    
df.info()









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 0 to 890
Data columns (total 13 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
isFemale       891 non-null int64
dtypes: float64(2), int64(6), object(5)

Setup our data to learn from



In [80]:

    
X = pd.DataFrame(df[[c for c in df.columns if c != 'Survived']])
X = X.drop(drop_cols, axis=1) 
X = X.drop('PassengerId', axis=1)
y = df.Survived
print X.head()









    



   Pclass  Age  SibSp  Parch     Fare  isFemale
0       3   22      1      0   7.2500         0
1       1   38      1      0  71.2833         1
2       3   26      0      0   7.9250         1
3       1   35      1      0  53.1000         1
4       3   35      0      0   8.0500         0

Have a quick look at the class distribution



In [81]:

    
y.groupby(y.values).count()









    Out[81]:





0    549
1    342
Name: Survived, dtype: int64

and fill in some NaNs for age



In [82]:

    
X['Age'] = X.Age.fillna(X.Age.median())

scikit-learn

Machine Learning in Python

Simple and efficient tools for data mining and data analysis
Accessible to everybody, and reusable in various contexts
Built on NumPy, SciPy, and matplotlib
Open source, commercially usable - BSD license

What is Machine Learning?

Machine learning, a branch of artificial intelligence, concerns the construction and study of systems that can learn from data thanks wikipedia

Prediction with scikit-learn is easy - who will survive?



In [83]:

    
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score as acc



In [84]:

    
# create our classifier
clf = LogisticRegression()
# fit it to the data
clf.fit(X, y)
# and predict
preds = clf.predict(X)
res_acc = acc(y, preds)
print 'Accuracy Score: {:.2f}'.format(res_acc)
print 'Not too bad'









    



Accuracy Score: 0.80
Not too bad

Cross-validation is a fairer performance estimate



In [85]:

    
from sklearn.cross_validation import KFold



In [86]:

    
cv = KFold(n=len(y), n_folds=5, shuffle=True)
preds = np.zeros_like(y)
for train, test in cv:
    clf = LogisticRegression()
    clf.fit(X.ix[train], y.ix[train])
    preds[test] = clf.predict(X.ix[test])
res_acc = acc(y, preds)
print 'Accuracy Score: {:.2f}'.format(res_acc)









    



Accuracy Score: 0.79

And cross-validation can be done more easily



In [87]:

    
# scikits can actually take care of this for us
from sklearn.cross_validation import cross_val_score

# here
clf = LogisticRegression()
scores = cross_val_score(clf, X, y, cv=5, scoring='accuracy')
# to here

print scores
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))









    



[ 0.78212291  0.77094972  0.78651685  0.75842697  0.81920904]
Accuracy: 0.78 (+/- 0.04)

dealing with categorical data



In [88]:

    
df.Embarked.head()









    Out[88]:





0    S
1    C
2    S
3    S
4    S
Name: Embarked, dtype: object



In [89]:

    
set(df.Embarked.fillna('O'))









    Out[89]:





{'C', 'O', 'Q', 'S'}

Use the LabelEncoder



In [90]:

    
from sklearn import preprocessing
df.Embarked = df.Embarked.fillna('O')
le = preprocessing.LabelEncoder()
le.fit(df.Embarked.values)
le.classes_









    Out[90]:





array(['C', 'O', 'Q', 'S'], dtype=object)



In [91]:

    
X['Embarked'] = le.transform(df.Embarked.values)
X.Embarked.head()









    Out[91]:





0    3
1    0
2    3
3    3
4    3
Name: Embarked, dtype: int64

tuning classifier parameters



In [94]:

    
for C in [0.001, 0.01, 0.1, 1, 10, 100]:
    clf = LogisticRegression(C=C, penalty='l1')
    scores = cross_val_score(clf, X, y, cv=5, scoring='accuracy')    
    print("n_estimators: {:3.3f}\tAccuracy: {:.2f} (+/- {:.2f})"
          .format(C, scores.mean(), scores.std() * 2))









    



n_estimators: 0.001	Accuracy: 0.66 (+/- 0.08)
n_estimators: 0.010	Accuracy: 0.67 (+/- 0.08)
n_estimators: 0.100	Accuracy: 0.79 (+/- 0.04)
n_estimators: 1.000	Accuracy: 0.79 (+/- 0.04)
n_estimators: 10.000	Accuracy: 0.78 (+/- 0.04)
n_estimators: 100.000	Accuracy: 0.78 (+/- 0.03)

Comparing classifiers is easy



In [95]:

    
# normalise the data
from sklearn.preprocessing import StandardScaler
X = StandardScaler().fit_transform(X)



In [96]:

    
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.lda import LDA
from sklearn.qda import QDA

names = ["Nearest Neighbors", "Linear SVM", "RBF SVM", "Decision Tree",
         "Random Forest", "AdaBoost", "Naive Bayes", "LDA",
         "QDA", "Logistic Regression"]
classifiers = [
    KNeighborsClassifier(3),
    SVC(kernel="linear", C=0.025),
    SVC(gamma=2, C=1),
    DecisionTreeClassifier(),
    RandomForestClassifier(),
    AdaBoostClassifier(),
    GaussianNB(),
    LDA(),
    QDA(),
    LogisticRegression(class_weight='auto')]



In [97]:

    
# fit each classifier and find the mean performance
res = []
for name, clf in zip(names, classifiers):
    scores = cross_val_score(clf, X, y, cv=5, scoring='accuracy')
    res.append(scores.mean())



In [98]:

    
import prettyplotlib as ppl
res = np.array(res)
names = np.array(names)
idx = np.argsort(res)[::-1]
fig, ax = plt.subplots(1, figsize=(14, 6))
ppl.bar(ax, np.arange(len(res)), res[idx], annotate=True,
        xticklabels=names[idx], grid='y')
plt.xticks(rotation=30)
_ = ax.set_ylim(res.min() * 0.95, res.max() * 1.05)

Models can be pickled



In [ ]:

    
# models can be saved
import pickle
s = pickle.dumps(clf)

And there is a whole lot scikit-learn can do..

###supervised learning
###model evaluation
###unsupervised learning
###feature selection
###feature extraction

by Andreas Mueller

		Age	Survived
Pclass	Sex
1	female	34.611765	0.968085
1	male	41.281386	0.368852
2	female	28.722973	0.921053
2	male	30.740707	0.157407
3	female	21.750000	0.500000
3	male	26.507589	0.135447

	new	orig
billy	True	True
florence	False	False
katrin	True	False
pablo	False	False
ricky	True	False
sally	False	False
tom	False	False

	Date	hipster	modcloth	gumtree perth
0	4/01/04	-0.976	-0.817	-0.844
1	11/01/04	-0.816	-0.817	-0.844
2	18/01/04	-0.837	-0.817	-0.844
3	25/01/04	-0.976	-0.817	-0.844
4	1/02/04	-0.722	-0.817	-0.844
5	8/02/04	-0.795	-0.817	-0.844
6	15/02/04	-0.723	-0.817	-0.844
7	22/02/04	-0.713	-0.817	-0.844
8	29/02/04	-0.786	-0.817	-0.844
9	7/03/04	-0.675	-0.817	-0.844

	hipster	modcloth	gumtree perth
2004-04-01	-0.976	-0.817	-0.844
2004-11-01	-0.816	-0.817	-0.844
2004-01-18	-0.837	-0.817	-0.844
2004-01-25	-0.976	-0.817	-0.844
2004-01-02	-0.722	-0.817	-0.844
2004-08-02	-0.795	-0.817	-0.844
2004-02-15	-0.723	-0.817	-0.844
2004-02-22	-0.713	-0.817	-0.844
2004-02-29	-0.786	-0.817	-0.844
2004-07-03	-0.675	-0.817	-0.844

	yellow pages	babelfish	sony ericsson	melway
2004-04-01	1.341	0.090	0.695	0.983
2004-11-01	1.239	0.256	0.846	0.714
2004-01-18	1.022	-0.023	0.558	0.694
2004-01-25	0.923	0.188	0.645	1.104
2004-01-02	0.904	-0.009	0.513	0.767
2004-08-02	0.786	0.338	0.694	1.131
2004-02-15	0.729	0.360	0.695	0.985
2004-02-22	0.537	-0.070	0.613	0.859
2004-02-29	0.534	0.273	0.447	1.286
2004-07-03	0.229	0.595	0.898	0.810

	hipster	modcloth	gumtree perth	yellow pages	babelfish	sony ericsson	melway
2012-01-01	1.411	1.192	1.774	-1.077	-1.132	-0.819	-1.058
2012-08-01	1.513	1.111	1.579	-0.995	-1.128	-0.889	-1.069
2012-01-15	1.523	1.427	1.613	-1.027	-1.182	-0.991	-1.074
2012-01-22	1.600	1.490	1.514	-1.140	-1.197	-0.975	-1.106
2012-01-29	1.459	1.561	1.511	-1.046	-1.189	-1.027	-1.079

	hipster	modcloth	gumtree perth	yellow pages	babelfish	sony ericsson	melway
2012-12-16	1.645	1.175	1.407	-1.433	-1.523	-1.545	-1.192
2012-12-23	1.591	1.695	1.625	-1.698	-1.540	-1.499	-1.207
2012-12-30	1.596	1.515	1.868	-1.515	-1.531	-1.535	-1.207

	hipster	modcloth	gumtree perth	yellow pages	babelfish	sony ericsson	melway
2008-02-11	-0.613	-0.759	-0.675	0.342	0.402	0.487	-0.054
2008-12-21	-0.626	-0.700	-0.650	0.103	0.305	0.973	-0.082
2008-12-28	-0.721	-0.749	-0.617	0.371	0.160	0.744	-0.193
2009-04-01	-0.713	-0.726	-0.602	0.815	0.256	0.618	-0.070
2009-11-01	-0.706	-0.745	-0.602	0.771	0.058	0.596	-0.034

	hipster	modcloth	gumtree perth	yellow pages	babelfish	sony ericsson	melway
2004-01-31	-0.84550	-0.817	-0.844	0.7915	0.16075	0.60925	0.8645
2004-02-29	-0.76725	-0.817	-0.844	0.5880	0.23400	0.60125	0.9355
2004-03-31	-0.81675	-0.817	-0.844	0.4460	0.40725	0.78550	0.8470
2004-04-30	-0.80260	-0.817	-0.844	0.6992	0.44000	0.62420	0.9614
2004-05-31	-0.77000	-0.817	-0.844	0.4052	0.53860	1.07800	0.7422

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35	0	373450	8.0500	NaN	S